Subhojyoti Mukherjee

I am a Ph.D. candidate in the Department of Electrical and Computer Engineering (ECE), University of Wisconsin-Madison.

Looking for full-time positions in the industry.

Download CV
Email: smukherjee27 [at] wisc [dot] edu

Statement and Vision

My broader vision is to build large-scale trustworthy Language, Vision and Machine Learning models. To achieve this I have looked into incorporating adaptive data collection strategies for LLM training and aligning LLMs with human feedback by collecting informative data. Building large-scale trustworthy Machine learning models is a challenging task, and to achieve this my past works have looked into various aspects of data collection for training models:

  1. Adaptive Data collection in Reinforcement Learning
  2. Understanding Incontext learning for Decision transformers
  3. Adaptive prompt design for LLMs, and aligning LLMs with human preferences through finetuning
  4. Safety in Machine Learning

My expertise ranges from research and developing algorithms to training machine learning models, Reinforcement Learning, fine-tuning LLMs, and prompt designing for LLMs. This expertise is crucial to build large-scale real-life systems for user interaction and understanding user preference from data.

Education

Ph.D. candidate
(Fall 2019 to Fall 2024 expected)
now
at ECE, University of Wisconsin Madison
advised by Dr. Robert Nowak, Dr. Josiah Hanna, and Dr. Qiaomin Xie

Areas of Research: Reinforcement Learning, Active Learning, incorporating deep active learning strategies for Large Language Models (LLMs), aligning Large Language Models with human feedback (RLHF), and understanding sequential decision-making using transformers (DT).

(Joint) Masters Thesis: Active Sequential Hypothesis Testing with Extension to Active Regression and Multi-armed Bandits pdf
M.S by Research
(2015 to 2018)
at CSE, Indian Institute of Technology (IIT) Madras
advised by Dr. Balaraman Ravindran, and Dr. Nandan Sudarsanam
RISE Lab

Areas of Research: Reinforcement learning, Stochastic and non-stochastic Multi-Armed Bandit settings.

Masters Thesis: Finite-time Analysis of Frequentist Strategies for Multi-armed Bandits pdf
Bachelor of Technology
(2009 to 2013)
at Dept. of Computer Science and Engineering
Meghnad Saha Institute of Technology, Kolkata
under West Bengal University of Technology, India

Research Internships

Amazon AWS AI, Santa Clara, USA
Summer 2024 (full-time)
hosted by Branislav Kveton, Anusha Lalitha
and: Sailik Sengupta, Yifei Ma, Aniket Deshmukh, Gaurush Hiranandani.

Area of Research: Multi-objective alignment for LLMs.
Amazon AWS AI, Santa Clara, USA
Fall 2023 (Part-time)
hosted by Branislav Kveton
and: Yifei Ma, Anusha Lalitha, Kousha Kalantiri, Ge Liu, Aniket Deshmukh, Anoop Deoras.

Area of Research: RLHF with LLMs.
Amazon AWS AI, Santa Clara, USA
Summer 2023 (Full-time)
hosted by Branislav Kveton
and: Yifei Ma, Anusha Lalitha, Ge Liu, Aniket Deshmukh, Anoop Deoras.

Area of Research: Active In-Context Learning with LLMs.
CMU, ECE Dept., Pittsburgh, USA
Summer 2019
hosted by Prof. Gauri Joshi
Area of Research: Structured Bandits.
Adobe Research, San Jose, USA
Spring 2018
hosted by Branislav Kveton
Area of Research: Item recommendation with Ranking and Bandits.
INRIA, SequeL Lab, Lille, France
Fall 2017
hosted by Odalric Maillard
Area of Research: Non-stationary Bandits.

Research Focus and Selected works

LLMs, RLHF, and Prompt Design

Multi-Objective Alignment of LLMs

Multi-Objective Alignment of Large Language Models Through Hypervolume Maximization

Multi-objective alignment from human feedback (MOAHF) in large language models (LLMs) is a challenging problem as human preferences are complex, multifaceted, and often conflicting. Recent works on MOAHF considered a-priori multi-objective optimization (MOO), where human preferences are known at training or inference time. In contrast, when human preferences are unknown or difficult to quantify, a natural approach is to cover the Pareto front by multiple diverse solutions. We propose an algorithm HaM for learning diverse LLM policies that maximizes their hypervolume. This is the first application of a-posteriori MOO to MOAHF. HaM is computationally and space efficient, and empirically superior across objectives such as harmlessness, helpfulness, humor, faithfulness, and hallucination, on various datasets. pdf

Optimal Design for RLHF

Optimal Design for Human Feedback for Training Large Language Models (NeurIPS 2024 main conference)

We study the problem of data collection for learning preference models. The key idea in our work is to generalize the optimal design, a method for computing information gathering policies, to ranked lists. We design efficient algorithms and experiment with several synthetic and real-world datasets to show the statistical efficiency of our algorithms. pdf

Performance of Logged Feedback in LLMs

Off-Policy Evaluation from Logged Human Feedback using Large Language Models (ICML 2024 Workshop)

We study off-policy evaluation from logged human feedback. We formalize the problem, propose both model-based and model-free estimators for policy values, and show how to optimize them. We analyze unbiasedness of our estimators and evaluate them empirically wit Large Language Models. Our estimators can predict the absolute values of evaluated policies, rank them, and be optimized. pdf

Results in LLMs

Optimal Design for Adaptive In-Context Prompt Selection in Large Language Models

We use active learning for adaptive prompt design and call it Active In-context Prompt Design (AIPD). We design the LLM prompt by adaptively choosing few-shot informative examples using Optimal Design from a training set to optimize performance on a test set. We experiment in different tasks with small, medium, and large sized LLMs; and show that our proposed algorithms GO and SAL outperform other methods for choosing few-shot examples in the LLM prompt at inference. pdf

Transformers, Multi-task Learning and Incontext Learning

PredeTor Performance in GPT2

Pretraining Decision Transformers with Reward Prediction for In-Context Multi-task Learning

We study multi-task RL problem where the goal is to learn a near-optimal algorithm that minimizes cumulative regret. The tasks share a common structure and the algorithm exploits the shared structure to minimize the cumulative regret for an unseen but related test task. We use a transformer (CausalLM GPT2 model) as a decision-making algorithm to learn this shared structure imlpicitly so as to generalize to the test task. Our model outperforms other SOTA methods like DPT, and imitation learning algorithms like Algorithmic Distillation (AD) over a series of experiments on several structured bandit problems. pdf

Representation Learning Performance

Multi-task Representation Learning for Pure Exploration in Bilinear Bandits (Neurips 2023)

We study multi-task representation learning for the problem of pure exploration in bilinear bandits. We aim to find optimal items for multiple tasks that share a common low-dimensional linear representation. We propose and analyze the algorithm GOBLIN that uses an Optimal Design approach to optimize sample allocations for learning the global representation as well as minimize the number of samples needed to identify the optimal pair of items in individual tasks. pdf

Reinforcement Learning

Speed performance

SPEED: Experimental Design for Policy Evaluation in Linear Heteroscedastic Bandits (AISTATS 2024)

In this paper, we study the problem of optimal data collection for policy evaluation in linear bandits. In policy evaluation, we are given a target policy and asked to estimate the expected reward it will obtain when executed in a multi-armed bandit environment. Our work is the first work that focuses on such an optimal data collection strategy for policy evaluation involving heteroscedastic reward noise in the linear bandit setting. pdf

ReVar Performance

ReVar: Strengthening Policy Evaluation via Reduced Variance Sampling (UAI 2022)

We study the problem of data collection for policy evaluation in Markov decision processes (MDPs). In policy evaluation, we are given a target policy and asked to estimate the expected cumulative reward it will obtain in an environment formalized as an MDP. We develop and analyze the algorithm Reduced Variance Sampling (ReVar) algorithm that approximates the oracle strategy when the reward variances are unknown a priori and bound its sub-optimality compared to the oracle strategy. Finally, we empirically validate that ReVar leads to policy evaluation with mean squared error comparable to the oracle strategy and significantly lower than simply running the target policy. pdf

Active Learning Algorithm

Chernoff Sampling for Active Testing and Extension to Active Regression (AISTATS 2022)

Active learning can reduce the number of samples needed to perform a hypothesis test and to estimate the parameters of a model. We revisit the work of Chernoff that described an asymptotically optimal algorithm for performing a hypothesis test. We obtain a novel sample complexity bound for Chernoff’s algorithm, with a non-asymptotic term that characterizes its performance at a fixed confidence level. We also develop an extension of Chernoff sampling that can be used to estimate the parameters of a wide variety of models and we obtain a non-asymptotic bound on the estimation error. We apply our extension of Chernoff sampling to actively learn neural network models and to estimate parameters in real-data linear and non-linear regression problems, where our approach performs favorably to state-of-the-art methods. pdf

Safety in RL

SaVeR Performance

SaVeR: Optimal Data Collection Strategy for Safe Policy Evaluation in Tabular MDP (ICML 2024)

We study safe data collection for the purpose of policy evaluation in tabular Markov decision processes (MDPs). While prior work has considered behavior policy selection, in this paper, we additionally consider a safety constraint on the behavior policy. We then introduce an algorithm SaVeR for this problem that approximates the (best possible) safe oracle algorithm and bound the finite-sample mean squared error of the algorithm while ensuring it satisfies the safety constraint. Finally, we show in simulations that SaVeR produces low MSE policy evaluation while satisfying the safety constraint. pdf

Safety Bandits

Safety Aware Changepoint Detection for Piecewise i.i.d. Bandits (UAI 2022)

We consider the setting of piecewise i.i.d. bandits under a safety constraint. In this setting, there exists a finite number of changepoints where the mean score of some or all actions (items) change simultaneously. We propose two actively adaptive algorithms for this setting that satisfy the safety constraint, detect changepoints, and restart without the knowledge of the number of changepoints or their locations. Empirically, we show that our safety-aware algorithms perform similarly to the SOTA adaptive algorithms that do not satisfy the safety constraint. pdf

News

2024

2023

2022

2021

2020

2019